Thinking About Data

Dave Clark

Binghamton University

January 10, 2025

Get to know your data, explore etc

code
# anscombes quartet 
longanscombe <- anscombe %>%

pivot_longer(cols = everything(), #pivot all the columns
             cols_vary = "slowest", #keep the datasets together 
              names_to = c(".value", "set"), #new var names; .value=stem of vars
              names_pattern = "(.)(.)") #to extract var names 


kable(
  list(anscombe),
  caption="Anscombe's Quartet",
  booktabs = TRUE,
  valign = 't',
  row.names = FALSE,
)
Anscombe’s Quartet
x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89
code
B <- data.frame(set=numeric(0), b0=numeric(0), b1=numeric(0))
for (i in longanscombe$set) {
    m <- lm(y ~ x, data=longanscombe %>% filter(set==i))
    B[i:i,] <- data.frame(i, coef(m)[1], coef(m)[2])
}


kable(
  list(B),
  caption="Anscombe's Quartet",
  booktabs = TRUE,
  valign = 't',
  row.names = FALSE,
)
Anscombe’s Quartet
set b0 b1
1 3.000091 0.5000909
2 3.000909 0.5000000
3 3.002454 0.4997273
4 3.001727 0.4999091
code
ggplot(longanscombe, aes(x = x, y = y)) +
  geom_point() + 
  facet_wrap(~set) +
  geom_smooth(method = "lm", se = FALSE, color="red")

Data Generating Process

  • What produced the data we observe?
    • political process, actors, etc.
    • are those actors purposeful wrt the observed data?
    • data collector; choices, biases, mistakes.

Data Generating Process

  • Why do we observe the data we see and not the data we don’t?
    • existence is not randomly determined.
    • research questions are usually about things that happen, not things that do not or cannot.
    • reporting itself is a political process.
    • reporting is shaped by resources

A Terrible Map

War outcomes

Outcome Autocrat Democrat Total
Loser 42 9 51
Winner 32 38 70
Total 74 47 121

From Lake (1992), p. 31

Observability

Asking the right question

Data

Collections of alike units, their characteristics, features, choice sets, behaviors, etc.

  • what are the units? In what ways are they heterogeneous?
  • what units are included? Which ones are missing? Why?
  • what do the variables measure?
  • how are the variables measured?
  • what observations are missing? Why?
  • what is the sample; what is the population (sampling frame) from which the sample is drawn?

Types of variables

Variables are either

  • discrete - observations match to integers; all possible values are clearly distinguishable; not divisible. E.g., number of protests in DC this year; an individual’s sex; Polity score.

  • continuous - observations can take on any real value between boundaries (sometimes $-, +); infinitely divisible. E.g., household income, GDP per capita.

Discrete variables

May be of two types or levels of measurement:

  • nominal - categories are distinct, but lack order. E.g., religion = Hindu, Muslim, Catholic, Protestent, Jewish. Binary variables are nominal, e.g., Sex = male (0), female (1); do you have blue eyes? yes (0), no (1).

  • ordinal - take on countable values, increasing/decreasing in some dimension. E.g., Polity -10, -9, \(\ldots\) 0, 1, \(\ldots\) 9, 10 increasing in democracy; survey responses “Do you feel safe traveling abroad?” Not at all; sometimes; yes, completely.

Continuous variables

Can be of two types (levels of measurement):

  • interval - 1 unit increase has same meaning across the scale (i.e., the intervals are the same); e.g., degrees Celsius or Fahrenheit.

  • ratio - intervals but also has a meaningful absolute zero; e.g., weight in pounds; zero lbs indicates the absence of weight; Venmo balance = zero, means actually no money; degrees Kelvin. Duration of a war in days - zero days means there’s no war.

Levels of measurement

These four levels or measurement can be ordered by the amount of information a variable contains:

  • nominal

  • ordinal

  • interval

  • ratio

We can turn higher levels to lower levels, but not the opposite - doing so sacrifices information.

Levels of Measurement and Models

In general, the level of measurement of \(y\) (so the type and amount of information in a variable) shapes what type of model is appropriate.

  • discrete variables usually require statistics/models in the Binomial family (for our purposes, mostly MLE models like the Logit.)

  • continuous variables usually require statistics/models in the Normal/Gaussian family (for our purposes, mostly OLS models like the linear regression.)

Describe these data

code
gap <- read.csv("/Users/dave/documents/teaching/501/2024/slides/L1-data/data/gapminder.csv")

ggplot(gap, aes(x=alcohol_consumption_per_adult_15plus_litres)) +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") +
   labs(x = "Liters of Booze", y= "Density", caption="Alcohol Consumption, Gapminder")+
  ggtitle("Density - Alcohol Consumption per Adult (Liters)") 

code
gap$nalc <- dnorm(gap$alcohol_consumption_per_adult_15plus_litres, mean=mean(gap$alcohol_consumption_per_adult_15plus_litres, na.rm = TRUE), sd=sd(gap$alcohol_consumption_per_adult_15plus_litres, na.rm = TRUE))

ggplot() + 
  geom_histogram(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres, y=..density..), colour="black", fill="white") +
  geom_density(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres), alpha=.2, fill="#FF6666")  +
 geom_line(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres, y=nalc), linetype="longdash", size=1)+
   labs(x = "Liters of Booze", y= "Density", caption="Alcohol Consumption, Gapminder")+
  ggtitle("Alcohol Consumption per Adult (Liters), Normal PDF")

Describe these data

Polity Scores

code
polity <- polity %>% mutate(era = ifelse(year==1980, 1980, ifelse(year==2018, 2018, 0)))

p <- ggplot(data=polity %>% filter(era!=0), aes(y=polity2), colour="black", fill="white")+
  geom_bar()+
  geom_text(aes(label = ..count..),stat="count", hjust = -.2, colour = "black", size=2.5, 
            position = position_dodge(0.9)) 

p + facet_wrap(era ~ .) +
  labs(y = "Polity score", x= "Countries", caption="Polity project")+
  ggtitle("Polity Scores, 1980 and 2018") 

Overstaying Terms

expand for full code
#polity <- read_dta("/Users/dave/Documents/2023/PITF/slides/polity5.dta")
overpolity <- left_join(tl, polity, by=c("ccode"="ccode", "year"="year"))

success <- overpolity %>%
  group_by(polity2) %>%
  summarise_at(c("tl_success"), sum) %>%
  filter(!is.na(polity2)) %>%
  mutate(outcome = "succeeded") %>%
  mutate(events=tl_success)%>%
  subset(select = -c(tl_success))

fail <- overpolity %>%
  group_by(polity2) %>%
  summarise_at(c("tl_failed"), sum) %>%
  filter(!is.na(polity2))  %>%
  mutate(outcome = "failed") %>%
  mutate(events=tl_failed) %>%
  subset(select = -c(tl_failed))

tlpolity <- rbind(success, fail)

ggplot(tlpolity, aes(fill=outcome, y=events, x=polity2)) +
  geom_bar(position="stack", stat="identity",)+
  scale_fill_manual(values=c("dark green", "light green")) +
  labs(x = "Polity", y= "Frequency", caption="Overstaying data (Versteeg et al. 2020)") +
  ggtitle("Overstay Attempts over Regime") +
  scale_x_continuous(breaks=seq(-10, 10, 1))

Thinking about data

  • what are the units? Why?
  • why are these variables in the data? Why not others?
  • what is purposely not in the data?
  • what in incidentally not in the data?
  • why were the data generated? Does that purpose fit your purpose?
  • are the data dynamic or static over time?

References

Lake, David A. 1992. “Powerful Pacifists: Democratic States and War.” American Political Science Review 86 (1): 24–37.